5 research outputs found
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
An increasing number of vision-language tasks can be handled with little to
no training, i.e., in a zero and few-shot manner, by marrying large language
models (LLMs) to vision encoders, resulting in large vision-language models
(LVLMs). While this has huge upsides, such as not requiring training data or
custom architectures, how an input is presented to a LVLM can have a major
impact on zero-shot model performance. In particular, inputs phrased in an
underspecified way can result in incorrect answers due to factors like missing
visual information, complex implicit reasoning, or linguistic ambiguity.
Therefore, adding visually grounded information to the input as a preemptive
clarification should improve model performance by reducing underspecification,
e.g., by localizing objects and disambiguating references. Similarly, in the
VQA setting, changing the way questions are framed can make them easier for
models to answer. To this end, we present Rephrase, Augment and Reason
(RepARe), a gradient-free framework that extracts salient details about the
image using the underlying LVLM as a captioner and reasoner, in order to
propose modifications to the original question. We then use the LVLM's
confidence over a generated answer as an unsupervised scoring function to
select the rephrased question most likely to improve zero-shot performance.
Focusing on two visual question answering tasks, we show that RepARe can result
in a 3.85% (absolute) increase in zero-shot performance on VQAv2 and a 6.41%
point increase on A-OKVQA. Additionally, we find that using gold answers for
oracle question candidate selection achieves a substantial gain in VQA accuracy
by up to 14.41%. Through extensive analysis, we demonstrate that outputs from
RepARe increase syntactic complexity, and effectively utilize vision-language
interaction and the frozen language model in LVLMs.Comment: 22 pages, 4 figures, Code: https://github.com/archiki/RepAR
ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
Multi-step reasoning ability is fundamental to many natural language tasks,
yet it is unclear what constitutes a good reasoning chain and how to evaluate
them. Most existing methods focus solely on whether the reasoning chain leads
to the correct conclusion, but this answer-oriented view may confound the
quality of reasoning with other spurious shortcuts to predict the answer. To
bridge this gap, we evaluate reasoning chains by viewing them as informal
proofs that derive the final answer. Specifically, we propose ReCEval
(Reasoning Chain Evaluation), a framework that evaluates reasoning chains
through two key properties: (1) correctness, i.e., each step makes a valid
inference based on the information contained within the step, preceding steps,
and input context, and (2) informativeness, i.e., each step provides new
information that is helpful towards deriving the generated answer. We implement
ReCEval using natural language inference models and information-theoretic
measures. On multiple datasets, ReCEval is highly effective in identifying
different types of errors, resulting in notable improvements compared to prior
methods. We demonstrate that our informativeness metric captures the expected
flow of information in high-quality reasoning chains and we also analyze the
impact of previous steps on evaluating correctness and informativeness.
Finally, we show that scoring reasoning chains based on ReCEval can improve
downstream performance of reasoning tasks. Our code is publicly available at:
https://github.com/archiki/ReCEvalComment: 20 pages, 3 figure
ADaPT: As-Needed Decomposition and Planning with Language Models
Large Language Models (LLMs) are increasingly being used for interactive
decision-making tasks requiring planning and adapting to the environment.
Recent works employ LLMs-as-agents in broadly two ways: iteratively determining
the next action (iterative executors) or generating plans and executing
sub-tasks using LLMs (plan-and-execute). However, these methods struggle with
task complexity, as the inability to execute any sub-task may lead to task
failure. To address these shortcomings, we introduce As-Needed Decomposition
and Planning for complex Tasks (ADaPT), an approach that explicitly plans and
decomposes complex sub-tasks as-needed, i.e., when the LLM is unable to execute
them. ADaPT recursively decomposes sub-tasks to adapt to both task complexity
and LLM capability. Our results demonstrate that ADaPT substantially
outperforms established strong baselines, achieving success rates up to 28.3%
higher in ALFWorld, 27% in WebShop, and 33% in TextCraft -- a novel
compositional dataset that we introduce. Through extensive analysis, we
illustrate the importance of multilevel decomposition and establish that ADaPT
dynamically adjusts to the capabilities of the executor LLM as well as to task
complexity.Comment: Project Page: https://allenai.github.io/adaptll
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Providing natural language instructions in prompts is a useful new paradigm
for improving task performance of large language models in a zero-shot setting.
Recent work has aimed to improve such prompts via manual rewriting or
gradient-based tuning. However, manual rewriting is time-consuming and requires
subjective interpretation, while gradient-based tuning can be extremely
computationally demanding for large models and requires full access to model
weights, which may not be available for API-based models. In this work, we
introduce Gradient-free Instructional Prompt Search (GrIPS), a gradient-free,
edit-based search approach for improving task instructions for large language
models. GrIPS takes in instructions designed for humans and automatically
returns an improved, edited prompt, while allowing for API-based tuning. The
instructions in our search are iteratively edited using four operations
(delete, add, swap, paraphrase) on text at the phrase-level. With InstructGPT
models, GrIPS improves the average task performance by up to 4.30 percentage
points on eight classification tasks from the Natural-Instructions dataset. We
see improvements for both instruction-only prompts and for k-shot
example+instruction prompts. Notably, GrIPS outperforms manual rewriting
following the guidelines in Mishra et al. (2022) and also outperforms purely
example-based prompts while controlling for the available compute and data
budget. Lastly, we provide qualitative analysis of the edited instructions
across several scales of GPT models. Our code is available at:
https://github.com/archiki/GrIP